Chapter 1: Introduction

Most of people care about the price of residential properties. Washington D.C. is the capital of the US. It is interesting to take a view at the price of DC residential properties. Intuitively, DC is a city that has faced one of the highest price level in the US. Generally, in 2017, the population in DC was nearly 700,000 and the average price of a single family was approximately 649,000. In this report, we aim to find what factors influence the price significantly.

Chapter 2: Data Description

2.1 Data Source

Link: https://www.kaggle.com/christophercorrea/dc-residential-properties
The main source of our data is a csv file in kaggle, which is the 7th version updated in July 2018.
Link: https://data.worldbank.org/indicator/fp.cpi.totl.zg
Another dataset about inflation is a csv file in World Bank Open Data, which is updated in January 2019.

2.2 Dataset

This dataset has 49 variables including price. Apart from price, these variables can be divided into four groups: attributes of house itself, time related factors, assessment of third party, and location. For the full name of these variables, these infomation is introduced in detail in https://www.kaggle.com/christophercorrea/dc-residential-properties.

#str(rawrp)
names(rawrp)
##  [1] "X.1"                "BATHRM"             "HF_BATHRM"         
##  [4] "HEAT"               "AC"                 "NUM_UNITS"         
##  [7] "ROOMS"              "BEDRM"              "AYB"               
## [10] "YR_RMDL"            "EYB"                "STORIES"           
## [13] "SALEDATE"           "PRICE"              "QUALIFIED"         
## [16] "SALE_NUM"           "GBA"                "BLDG_NUM"          
## [19] "STYLE"              "STRUCT"             "GRADE"             
## [22] "CNDTN"              "EXTWALL"            "ROOF"              
## [25] "INTWALL"            "KITCHENS"           "FIREPLACES"        
## [28] "USECODE"            "LANDAREA"           "GIS_LAST_MOD_DTTM" 
## [31] "SOURCE"             "CMPLX_NUM"          "LIVING_GBA"        
## [34] "FULLADDRESS"        "CITY"               "STATE"             
## [37] "ZIPCODE"            "NATIONALGRID"       "LATITUDE"          
## [40] "LONGITUDE"          "ASSESSMENT_NBHD"    "ASSESSMENT_SUBNBHD"
## [43] "CENSUS_TRACT"       "CENSUS_BLOCK"       "WARD"              
## [46] "SQUARE"             "X"                  "Y"                 
## [49] "QUADRANT"

2.3 Data Preprocessing and Cleaning

For the time part, we convert the date format (dd/mm/yyyy) into year format (yyyy) and then calculate the time span. For example, the time span of the building (SPAN_AYB) is from the year the structure was first built to the last date of sale. Similar for the time span of the year remodeled (SPAN_YR_RMDL) and the time span of an improvement was built (SPAN_EYB). Then, we can use these time spans numerically. We also set a new variable that is price per square foot, which is a common-use measurement.

2.3.1 Cleaning Redundant Variables

There are so many redundant variables, such as latitude & longitude and X & Y. In this report, we ignore nearly half of the variables here.

prerp <- rawrp[-c(1,18,28,30:36,38:40,44,46:48)]
prerp$SALE_YR <- as.integer(format(as.Date(prerp$SALEDATE),'%Y'))
prerp$SPAN_YR_RMDL <- prerp$SALE_YR - prerp$YR_RMDL
prerp$SPAN_AYB <- prerp$SALE_YR - prerp$AYB
prerp$SPAN_EYB <- prerp$SALE_YR - prerp$EYB
#names(prerp)

prerp <- prerp[-c(8:10,12)]

prerp$BATHRM <- prerp$BATHRM + 0.5*prerp$HF_BATHRM
prerp <- prerp[-c(2)]

prerp$SALE_NUM <- as.factor(prerp$SALE_NUM)


prerp$PRICE_GBA <- prerp$PRICE/prerp$GBA

names(prerp)
##  [1] "BATHRM"             "HEAT"               "AC"                
##  [4] "NUM_UNITS"          "ROOMS"              "BEDRM"             
##  [7] "STORIES"            "PRICE"              "QUALIFIED"         
## [10] "SALE_NUM"           "GBA"                "STYLE"             
## [13] "STRUCT"             "GRADE"              "CNDTN"             
## [16] "EXTWALL"            "ROOF"               "INTWALL"           
## [19] "KITCHENS"           "FIREPLACES"         "LANDAREA"          
## [22] "ZIPCODE"            "ASSESSMENT_NBHD"    "ASSESSMENT_SUBNBHD"
## [25] "CENSUS_TRACT"       "WARD"               "QUADRANT"          
## [28] "SALE_YR"            "SPAN_YR_RMDL"       "SPAN_AYB"          
## [31] "SPAN_EYB"           "PRICE_GBA"

2.3.2 Omit NA

Following, we only select the complete data to use. One thing should be noted, the factor variable air conditioner (AC) has three levels: 0, N, Y. Hereby, “0” means the data is not available, which is the same as NA. As a result, we narrow the dataset down from 150k rows to 33k rows.

nrow(prerp)
## [1] 158957
levels(prerp[,3])[1] <- NA #AC = "0" is NA
prerp <- na.omit(prerp)
#prerp <- prerp[!is.na(prerp$PRICE_GBA),]
nrow(prerp)
## [1] 33162

2.3.3 How to remove outliers? OutlierKD function?

We have finished the basic cleaning, but it is still a problem to see some extreme outliers. If we trivially use the outlierKD function w.r.t price to remove the outliers, we will find the upperbound of the housing price is only 1,000,000 USD. From the common knowledge, we know the average housing price in DC is over 600k USD and there are so many properties sold over 1,000,000 USD. Additonally, from the plot of price or price per square foot (scrolling down to see), these data is highly skewed to the right. Thus, using outlierKD function may remove too many useful data on the right and too little on the left.

Therefore we use this following method to remove more outliers on the left than the right, with 5% and 0.1% respectively. Although this approach is not very accurate and well-proved, it is still reasonable comparing to outlierKD function.

p_lowerbound <- sort(prerp$PRICE_GBA)[nrow(prerp)*0.05] # impossible PRICE_GBA <= 80
p_upperbound <- sort(prerp$PRICE_GBA)[nrow(prerp)*0.999] # still possible >= 1000
r_lowerbound <- sort(prerp$ROOMS)[nrow(prerp)*0.001]
r_upperbound <- sort(prerp$ROOMS)[nrow(prerp)*0.999]
rp <- subset(prerp, PRICE_GBA >= p_lowerbound & PRICE_GBA <= p_upperbound & ROOMS >= r_lowerbound & ROOMS <= r_upperbound & KITCHENS <= 4)

rp_num <- data.frame(sapply(rp, as.numeric)) # for correlation part

There are the extreme cases if we do not clean the outliers. One impossible instance is price is only 1 USD with 7 rooms. Besides, we also do not care about the luxury properties in this report.

price_extreme <- subset(prerp, PRICE_GBA <= p_lowerbound | PRICE_GBA >= p_upperbound)
subset(price_extreme, PRICE_GBA <= 1 | PRICE_GBA >= 3000) #very extreme cases
##        BATHRM          HEAT AC NUM_UNITS ROOMS BEDRM STORIES    PRICE
## 3331      2.5 Hot Water Rad  N         1     9     4       2      936
## 4200      1.5 Hot Water Rad  Y         1     6     3       2     1377
## 18569     2.5 Hot Water Rad  N         2     8     3       2       10
## 28637     2.5    Forced Air  Y         1     5     0       1  3675000
## 42070     3.5    Forced Air  Y         3     8     3       3      250
## 48415     2.0    Forced Air  N         1     8     5       3      250
## 49322     2.0    Forced Air  Y         1     9     6       3      250
## 54230     4.0    Forced Air  Y         2     8     4       3      500
## 54910     6.0    Forced Air  Y         2    10     6       3      250
## 55049     3.0 Hot Water Rad  Y         1     9     5       2      250
## 81695     1.5 Hot Water Rad  N         1     7     3       2        1
## 84166     2.0 Hot Water Rad  Y         2     8     2       2      250
## 90982     4.0    Forced Air  Y         4    16     4       2 11000000
## 90983     4.0    Forced Air  Y         4    16     4       2 11000000
## 100486    4.0 Hot Water Rad  Y         4    16     4       2     1000
## 102783    1.0    Air Exchng  N         1     5     2       2      250
## 106684    3.0    Forced Air  N         3    14     6       3        1
##        QUALIFIED SALE_NUM  GBA   STYLE        STRUCT         GRADE
## 3331           U        2 2142 2 Story    Row Inside  Good Quality
## 4200           U        1 1424 2 Story    Row Inside  Good Quality
## 18569          U        2 2272 2 Story         Multi      Superior
## 28637          Q        2  994 1 Story        Single      Superior
## 42070          U        1 3179 3 Story    Row Inside Above Average
## 48415          Q        1 2281 3 Story    Row Inside  Good Quality
## 49322          U        1 2990 3 Story    Row Inside  Good Quality
## 54230          U        1 2421 3 Story    Row Inside  Good Quality
## 54910          U        1 2596 3 Story    Row Inside  Good Quality
## 55049          U        1 1462 2 Story       Row End  Good Quality
## 81695          U        1 1420 2 Story        Single Above Average
## 84166          U        1 1298 2 Story    Row Inside       Average
## 90982          U        2 3440 2 Story         Multi       Average
## 90983          U        2 3400 2 Story         Multi       Average
## 100486         U        1 2736 2 Story         Multi       Average
## 102783         U        1 1132 2 Story Semi-Detached       Average
## 106684         U        1 2150 3 Story         Multi       Average
##            CNDTN      EXTWALL         ROOF       INTWALL KITCHENS
## 3331     Average Common Brick   Metal- Sms      Hardwood        1
## 4200     Average Common Brick     Built Up      Hardwood        1
## 18569    Average Common Brick     Built Up      Hardwood        2
## 28637  Very Good       Stucco        Slate      Hardwood        1
## 42070       Good Common Brick   Metal- Sms      Hardwood        3
## 48415    Average Common Brick   Metal- Sms    Wood Floor        1
## 49322    Average Common Brick   Metal- Sms      Hardwood        1
## 54230    Average Common Brick     Built Up      Hardwood        2
## 54910    Average Common Brick     Built Up Hardwood/Carp        2
## 55049       Good Common Brick     Built Up      Hardwood        1
## 81695    Average Common Brick Comp Shingle    Wood Floor        1
## 84166    Average Common Brick   Metal- Sms      Hardwood        2
## 90982       Good Common Brick     Built Up        Carpet        4
## 90983       Good Common Brick     Built Up        Carpet        4
## 100486   Average Common Brick   Metal- Sms    Wood Floor        4
## 102783   Average Brick Veneer   Metal- Sms    Wood Floor        1
## 106684   Average Common Brick     Built Up      Hardwood        3
##        FIREPLACES LANDAREA ZIPCODE  ASSESSMENT_NBHD     ASSESSMENT_SUBNBHD
## 3331            3     1398   20001       Old City 2       040 B Old City 2
## 4200            5     1360   20009       Old City 2       040 E Old City 2
## 18569           0     2400   20007       Georgetown       025 F Georgetown
## 28637           0    13940   20016   Wesley Heights   054 A Wesley Heights
## 42070           3     2700   20009 Columbia Heights 015 A Columbia Heights
## 48415           0     2417   20010 Columbia Heights 015 E Columbia Heights
## 49322           0     2853   20009 Columbia Heights 015 E Columbia Heights
## 54230           0     1170   20001     Ledroit Park     031 A Ledroit Park
## 54910           0     2520   20001     Ledroit Park     031 B Ledroit Park
## 55049           1     1152   20001     Ledroit Park     031 B Ledroit Park
## 81695           1     4400   20018    Michigan Park                       
## 84166           0     1040   20002       Old City 1       039 G Old City 1
## 90982           0     4116   20019 Fort Dupont Park 022 A Fort Dupont Park
## 90983           0     3504   20019 Fort Dupont Park 022 A Fort Dupont Park
## 100486          0     4224   20020        Anacostia        002 A Anacostia
## 102783          0     2216   20020        Anacostia        002 A Anacostia
## 106684          0     7811   20032 Congress Heights 016 A Congress Heights
##        CENSUS_TRACT   WARD QUADRANT SALE_YR SPAN_YR_RMDL SPAN_AYB SPAN_EYB
## 3331           4801 Ward 6       NW    1996            0       96       27
## 4200           4400 Ward 1       NW    2000           -4      100       40
## 18569           100 Ward 2       NW    2018            9      128       42
## 28637           801 Ward 3       NW    2016           13       10        4
## 42070          3700 Ward 1       NW    1999           -5       94       32
## 48415          3000 Ward 1       NW    1999           11       91       39
## 49322          3600 Ward 1       NW    1999           -2       89       39
## 54230          3400 Ward 1       NW    1999           -2       99       30
## 54910          3301 Ward 5       NW    1999          -13       93       24
## 55049          3301 Ward 5       NW    1999           -4       90       36
## 81695          9400 Ward 5       NE    2000           30       71       53
## 84166          7901 Ward 6       NE    1998           -1       89       37
## 90982          7703 Ward 7       SE    2017           15       74       53
## 90983          7703 Ward 7       SE    2017           15       74       53
## 100486         7504 Ward 8       SE    1999           -6       62       45
## 102783         7503 Ward 8       SE    1998           -5       93       44
## 106684        10900 Ward 8       SW    2000            1       47       29
##         PRICE_GBA
## 3331      0.43697
## 4200      0.96699
## 18569     0.00440
## 28637  3697.18310
## 42070     0.07864
## 48415     0.10960
## 49322     0.08361
## 54230     0.20653
## 54910     0.09630
## 55049     0.17100
## 81695     0.00070
## 84166     0.19260
## 90982  3197.67442
## 90983  3235.29412
## 100486    0.36550
## 102783    0.22085
## 106684    0.00047

We can compare the differences of the distribution between the within outliers and without outliers.
Within outliers:

boxplot(prerp$PRICE, ylab='PRICE')

boxplot(prerp$PRICE_GBA, ylab='PRICE_GBA')

Without outliers:

boxplot(rp$PRICE,ylab='PRICE')

boxplot(rp$PRICE_GBA,ylab='PRICE_GBA')

2.3.5 Cleaning outliers again

From the plot above, although the plot is more reasonable after the first outliers cleaning, we still want to make the distribution looks like normal distribution. At this step, we call the outlierKD function and it removes the outliers with the upperbound over 1,600,000 USD. Also, it is better to do statistical inference with the normal distributed data.

outlierKD(rp, PRICE)

## Outliers identified: 1633 nPropotion (%) of outliers: 5.5 nMean of the outliers: 2461816 nMean without removing outliers: 707479 nMean if we remove outliers: 611350 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n
rp <- na.omit(rp)

2.4 Correlation

Now we try to draw a (Pearson) correlation matrix among these variables. Before that, we assume these variables are numeric.
It is obvious to see that price is highly positive related to the gross building area and some other attributes about rooms, which is consistent with our prior knowledge.

rp_corr <- cor(rp_num)
corrplot(rp_corr, method = "square")

2.5 Summary of Statistics

After clearning all the data, we take a view at the basic statistics about this new dataset we will use in the report. In this dataset, the mean price is 611,350 USD, which is closed to value calculated by other authorities. The mean area is 1668 square feet. Besides, the median price and area is a little bit lower than the mean values. As for the deviation, it is quite large that the standard deviations are approximately half of the mean values both for price and area.

stat_rp <- stat.desc(rp)
show_stat <- stat_rp[sapply(stat_rp, is.numeric)]
show_stat
##                  BATHRM  NUM_UNITS      ROOMS       BEDRM   STORIES
## nbr.val      31435.0000 31435.0000  31435.000  31435.0000 31435.000
## nbr.null         1.0000    24.0000      0.000      4.0000     6.000
## nbr.na           0.0000     0.0000      0.000      0.0000     0.000
## min              0.0000     0.0000      3.000      0.0000     0.000
## max             11.0000     5.0000     20.000     15.0000   826.000
## range           11.0000     5.0000     17.000     15.0000   826.000
## sum          85760.0000 38175.0000 236437.000 110517.0000 67683.800
## median           2.5000     1.0000      7.000      3.0000     2.000
## mean             2.7282     1.2144      7.521      3.5157     2.153
## SE.mean          0.0063     0.0032      0.012      0.0064     0.029
## CI.mean.0.95     0.0123     0.0062      0.024      0.0126     0.057
## var              1.2388     0.3136      4.861      1.2956    26.135
## std.dev          1.1130     0.5600      2.205      1.1383     5.112
## coef.var         0.4080     0.4611      0.293      0.3238     2.374
##                        PRICE         GBA   KITCHENS FIREPLACES LANDAREA
## nbr.val             31435.00    31435.00 31435.0000 31435.0000    31435
## nbr.null                0.00        0.00    15.0000 15071.0000        0
## nbr.na                  0.00        0.00     0.0000     0.0000        0
## min                 45000.00      407.00     0.0000     0.0000      216
## max              11984000.00    11810.00     4.0000    13.0000   102340
## range            11939000.00    11403.00     4.0000    13.0000   102124
## sum           22239609301.00 55733190.00 39559.0000 24555.0000 99000459
## median             594900.00     1554.00     1.0000     1.0000     2066
## mean               707479.22     1772.97     1.2584     0.7811     3149
## SE.mean              3275.94        4.76     0.0033     0.0056       18
## CI.mean.0.95         6420.96        9.33     0.0064     0.0109       35
## var          337352735177.53   711879.37     0.3379     0.9774  9999155
## std.dev            580820.74      843.73     0.5813     0.9886     3162
## coef.var                0.82        0.48     0.4619     1.2656        1
##                      ZIPCODE CENSUS_TRACT      SALE_YR SPAN_YR_RMDL
## nbr.val          31435.00000     31435.00    31435.000    31435.000
## nbr.null             0.00000         0.00        0.000     7681.000
## nbr.na               0.00000         0.00        0.000        0.000
## min              20001.00000       100.00     1991.000      -26.000
## max              20052.00000     11100.00     2018.000     1995.000
## range               51.00000     11000.00       27.000     2021.000
## sum          629019390.00000 160893142.00 63198295.000   184358.000
## median           20010.00000      4802.00     2012.000        1.000
## mean             20010.16033      5118.28     2010.444        5.865
## SE.mean              0.04162        19.28        0.034        0.091
## CI.mean.0.95         0.08157        37.79        0.067        0.179
## var                 54.44258  11686239.73       36.996      262.915
## std.dev              7.37852      3418.51        6.082       16.215
## coef.var             0.00037         0.67        0.003        2.765
##                SPAN_AYB    SPAN_EYB  PRICE_GBA
## nbr.val        31435.00   31435.000    31435.0
## nbr.null          19.00      27.000        0.0
## nbr.na             0.00       0.000        0.0
## min              -14.00     -16.000       88.9
## max              262.00      92.000     1445.1
## range            276.00     108.000     1356.2
## sum          2715567.00 1278360.000 12396662.8
## median            88.00      43.000      374.7
## mean              86.39      40.667      394.4
## SE.mean            0.13       0.064        1.1
## CI.mean.0.95       0.25       0.126        2.2
## var              518.61     129.247    38275.6
## std.dev           22.77      11.369      195.6
## coef.var           0.26       0.280        0.5

2.6 Find the best model

2.6.1 Numeric model

For the numeric dataset, we generate the best linear model for price with 12 maximal number of variables. We have used three information criteria but all of these criteria have the same result. Thus, we only show the plot of “Adjusted R^2”. Without considering some variables are factor, the most significant variables are gross building area (GBA), date of sale (YR_SALE), number of fireplaces (FIREPLACES), time span of building (SPAN_AYB), time span of an improvement (SPAN_EYB), number of units (NUM_UNITS), heating system (HEAT), qualification (QUALIFIED), and population of a particular district (CENSUS_TRACT). i.e. PRICE ~ BATHRM + NUM_UNITS + GBA + FIREPLACES + QUALIFIED + CENSUS_TRACT + SALE_YR + SPAN_AYB + SPAN_EYB.

rp_num_best <- regsubsets(PRICE~. -PRICE_GBA-ZIPCODE, data = rp_num, method = "seqrep", nvmax = 12)
plot(rp_num_best, scale = "adjr2", main = "Adjusted R^2")

# In the "leaps" package, we can use scale=c("bic","Cp","adjr2","r2")
#plot(rp_num_best, scale = "bic", main = "BIC")
#plot(rp_num_best, scale = "Cp", main = "Cp")

Similar for the best linear model for price per area, we obtain the model with different variables: PRICE_GBA ~ BEDRM (number of bedrooms) + QUALIFIED + GRADE (assessment by other authorities) + CNDTN (condition assessed by public) + FIREPLACES + WARD (eight different districts divided in DC) + SALE_YR + SPAN_AYB + SPAN_EYB. Acctually, we prefer the model w.r.t price per area and we will prove it in the analysis of the later SMART questions.

rp_num_best <- regsubsets(PRICE_GBA~.-PRICE-GBA-ZIPCODE, data = rp_num, method = "seqrep", nvmax = 12)
plot(rp_num_best, scale = "adjr2", main = "Adjusted R^2")

# In the "leaps" package, we can use scale=c("bic","Cp","adjr2","r2")
#plot(rp_num_best, scale = "bic", main = "BIC")
#plot(rp_num_best, scale = "Cp", main = "Cp")

Recall the linear regression model for price and for price per area. All the coefficents of variables are very significant and the regression model fits the data moderately well (with adj. R^2 = 0.693 and 0.567 respectively).

reg_rp_num <- lm(PRICE ~ BATHRM + NUM_UNITS + GBA + FIREPLACES + QUALIFIED + CENSUS_TRACT + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp_num)
summary(reg_rp_num)
## 
## Call:
## lm(formula = PRICE ~ BATHRM + NUM_UNITS + GBA + FIREPLACES + 
##     QUALIFIED + CENSUS_TRACT + SALE_YR + SPAN_AYB + SPAN_EYB, 
##     data = rp_num)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2549521  -156520    -6689   137777  7115773 
## 
## Coefficients:
##                   Estimate    Std. Error t value Pr(>|t|)    
## (Intercept)  -70037411.126    699831.028  -100.1   <2e-16 ***
## BATHRM           31006.711      2484.506    12.5   <2e-16 ***
## NUM_UNITS       -91183.218      3541.307   -25.8   <2e-16 ***
## GBA                334.488         3.417    97.9   <2e-16 ***
## FIREPLACES       98561.721      2281.686    43.2   <2e-16 ***
## QUALIFIED      -107198.369      5270.501   -20.3   <2e-16 ***
## CENSUS_TRACT       -16.833         0.607   -27.7   <2e-16 ***
## SALE_YR          35007.629       350.331    99.9   <2e-16 ***
## SPAN_AYB          4301.970        86.016    50.0   <2e-16 ***
## SPAN_EYB        -10887.898       214.351   -50.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 322000 on 31425 degrees of freedom
## Multiple R-squared:  0.693,  Adjusted R-squared:  0.693 
## F-statistic: 7.87e+03 on 9 and 31425 DF,  p-value: <2e-16
vif(reg_rp_num)
##       BATHRM    NUM_UNITS          GBA   FIREPLACES    QUALIFIED 
##          2.3          1.2          2.5          1.5          1.0 
## CENSUS_TRACT      SALE_YR     SPAN_AYB     SPAN_EYB 
##          1.3          1.4          1.2          1.8
#rp_num$SALE_YR <- rp_num$SALE_YR - min(rp_num$SALE_YR) #find the reg. result is same
reg_rp_num_pg <- lm(PRICE_GBA ~ BEDRM + QUALIFIED + GRADE + CNDTN + FIREPLACES + WARD + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp_num)
summary(reg_rp_num_pg)
## 
## Call:
## lm(formula = PRICE_GBA ~ BEDRM + QUALIFIED + GRADE + CNDTN + 
##     FIREPLACES + WARD + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp_num)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -499.6  -85.3  -11.2   70.7 1266.3 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) -34904.4961    288.9141  -120.8   <2e-16 ***
## BEDRM          -30.0676      0.6904   -43.5   <2e-16 ***
## QUALIFIED      -72.0525      2.1160   -34.0   <2e-16 ***
## GRADE            5.7892      0.1933    29.9   <2e-16 ***
## CNDTN           10.2249      0.4062    25.2   <2e-16 ***
## FIREPLACES      27.9433      0.8573    32.6   <2e-16 ***
## WARD           -22.5467      0.4324   -52.1   <2e-16 ***
## SALE_YR         17.6536      0.1449   121.8   <2e-16 ***
## SPAN_AYB         1.9984      0.0352    56.7   <2e-16 ***
## SPAN_EYB        -4.0083      0.0881   -45.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 129 on 31425 degrees of freedom
## Multiple R-squared:  0.567,  Adjusted R-squared:  0.567 
## F-statistic: 4.57e+03 on 9 and 31425 DF,  p-value: <2e-16
vif(reg_rp_num_pg)
##      BEDRM  QUALIFIED      GRADE      CNDTN FIREPLACES       WARD 
##        1.2        1.0        1.3        1.3        1.4        1.3 
##    SALE_YR   SPAN_AYB   SPAN_EYB 
##        1.5        1.2        1.9

2.6.2 Model with factor variables

The previous model is still based on the assumption of numeric variables. We just trivially try the same model in the raw variables. Although it may not be the best model, this model still has some reference value for analyzing the best linear model.
It should be noted that the best linear model for price per area fits the data better while the model for price fits the data worse. It is one of the reasons that we prefer to use the measurement of price per area here.
According the VIF values, all of these models have very low VIF values, which means it is not significant these models have the problem of multicolinearity.

reg_rp <- lm(PRICE ~ NUM_UNITS + GBA + FIREPLACES + CENSUS_TRACT + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp)
summary(reg_rp)
## 
## Call:
## lm(formula = PRICE ~ NUM_UNITS + GBA + FIREPLACES + CENSUS_TRACT + 
##     SALE_YR + SPAN_AYB + SPAN_EYB, data = rp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2611842  -158447    -4069   142338  7044929 
## 
## Coefficients:
##                   Estimate    Std. Error t value Pr(>|t|)    
## (Intercept)  -74460583.841    673492.954  -110.6   <2e-16 ***
## NUM_UNITS       -82794.941      3477.628   -23.8   <2e-16 ***
## GBA                353.878         2.878   123.0   <2e-16 ***
## FIREPLACES      100988.815      2300.878    43.9   <2e-16 ***
## CENSUS_TRACT       -18.208         0.608   -30.0   <2e-16 ***
## SALE_YR          37188.383       337.015   110.3   <2e-16 ***
## SPAN_AYB          4325.679        86.803    49.8   <2e-16 ***
## SPAN_EYB        -11884.914       210.514   -56.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 325000 on 31427 degrees of freedom
## Multiple R-squared:  0.687,  Adjusted R-squared:  0.687 
## F-statistic: 9.85e+03 on 7 and 31427 DF,  p-value: <2e-16
vif(reg_rp)
##    NUM_UNITS          GBA   FIREPLACES CENSUS_TRACT      SALE_YR 
##          1.1          1.8          1.5          1.3          1.3 
##     SPAN_AYB     SPAN_EYB 
##          1.2          1.7
reg_rp_pg <- lm(PRICE_GBA ~ BEDRM + QUALIFIED + GRADE + CNDTN + FIREPLACES + WARD + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp)
summary(reg_rp_pg)
## 
## Call:
## lm(formula = PRICE_GBA ~ BEDRM + QUALIFIED + GRADE + CNDTN + 
##     FIREPLACES + WARD + SALE_YR + SPAN_AYB + SPAN_EYB, data = rp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -535.2  -69.3   -7.5   61.1 1248.4 
## 
## Coefficients:
##                       Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)        -34494.7427    248.1859 -138.99  < 2e-16 ***
## BEDRM                 -24.9726      0.6075  -41.11  < 2e-16 ***
## QUALIFIEDU            -67.4641      1.7803  -37.89  < 2e-16 ***
## GRADEAverage          -17.5680      1.7035  -10.31  < 2e-16 ***
## GRADEExcellent         69.4358      3.8965   17.82  < 2e-16 ***
## GRADEExceptional-A    134.3369      6.5534   20.50  < 2e-16 ***
## GRADEExceptional-B    229.0489     10.8798   21.05  < 2e-16 ***
## GRADEExceptional-C    388.1131     21.2781   18.24  < 2e-16 ***
## GRADEExceptional-D    378.2852     27.2779   13.87  < 2e-16 ***
## GRADEFair Quality     -21.7109     18.9193   -1.15    0.251    
## GRADEGood Quality      23.9241      1.8993   12.60  < 2e-16 ***
## GRADESuperior         119.0405      4.3625   27.29  < 2e-16 ***
## GRADEVery Good         39.9269      2.7157   14.70  < 2e-16 ***
## CNDTNExcellent        213.1744     12.3316   17.29  < 2e-16 ***
## CNDTNFair              11.5871     13.1724    0.88    0.379    
## CNDTNGood              32.9560      1.5658   21.05  < 2e-16 ***
## CNDTNPoor              81.9009     44.1474    1.86    0.064 .  
## CNDTNVery Good        104.5957      2.3335   44.82  < 2e-16 ***
## FIREPLACES              4.5339      0.7697    5.89  3.9e-09 ***
## WARDWard 2            141.4621      3.4090   41.50  < 2e-16 ***
## WARDWard 3             63.1821      2.9670   21.30  < 2e-16 ***
## WARDWard 4            -56.6996      2.6304  -21.56  < 2e-16 ***
## WARDWard 5            -82.8398      2.7241  -30.41  < 2e-16 ***
## WARDWard 6             20.0013      2.4500    8.16  3.4e-16 ***
## WARDWard 7           -155.1005      3.1477  -49.27  < 2e-16 ***
## WARDWard 8           -201.1174      3.7956  -52.99  < 2e-16 ***
## SALE_YR                17.3763      0.1245  139.60  < 2e-16 ***
## SPAN_AYB                0.6616      0.0367   18.04  < 2e-16 ***
## SPAN_EYB               -0.8913      0.0850  -10.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 108 on 31406 degrees of freedom
## Multiple R-squared:  0.696,  Adjusted R-squared:  0.695 
## F-statistic: 2.56e+03 on 28 and 31406 DF,  p-value: <2e-16
vif(reg_rp_pg)
##              BEDRM         QUALIFIEDU       GRADEAverage 
##                1.3                1.0                1.7 
##     GRADEExcellent GRADEExceptional-A GRADEExceptional-B 
##                1.5                1.4                1.1 
## GRADEExceptional-C GRADEExceptional-D  GRADEFair Quality 
##                1.0                1.1                1.0 
##  GRADEGood Quality      GRADESuperior     GRADEVery Good 
##                1.7                1.8                1.7 
##     CNDTNExcellent          CNDTNFair          CNDTNGood 
##                1.1                1.0                1.6 
##          CNDTNPoor     CNDTNVery Good         FIREPLACES 
##                1.0                1.7                1.6 
##         WARDWard 2         WARDWard 3         WARDWard 4 
##                2.4                3.0                2.6 
##         WARDWard 5         WARDWard 6         WARDWard 7 
##                2.5                2.8                2.4 
##         WARDWard 8            SALE_YR           SPAN_AYB 
##                1.6                1.5                1.9 
##           SPAN_EYB 
##                2.5

Chapter 3: SMART Questions

Our report has 4 aspects of SMART Questions: about floor plan, about location, about time related factors, about facilities.

3.1 SMART Questions about Floor Plan & Area

Generally, the floor plan is the most important element to attract home buyers. Therefore, it must be tightly related to the price. In this part, we will research the correlation between price and variables that are related to the floor plan.

3.1.1 How does floor plan affect price?

We filter three factors from cleaned data that are the number of bathrooms(BATHRM), the number of bedrooms(BEDRM) and the number of the kitchen(KITCHENS). According to the correlation plot, we will have an overview of the correlation between price and there three factors. Basically, it shows the bathroom tightly relates to price and the kitchen has the weakest correlation. We will figure out the reason in the next steps.

fp_corr <- cor(rp_num[c(1,6,19,8)])
corrplot(fp_corr, method = "square")

Since the number of kitchens looks weakly affected to price, we build two linear regression models to figure out the relationship between price and kitchen.

fpmd <- lm(PRICE ~ BATHRM + BEDRM + KITCHENS , data = rp_num)
summary(fpmd)
## 
## Call:
## lm(formula = PRICE ~ BATHRM + BEDRM + KITCHENS, data = rp_num)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2676540  -262056   -48967   189851 10001563 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -107405       9143   -11.8   <2e-16 ***
## BATHRM        290460       3139    92.5   <2e-16 ***
## BEDRM          50725       3112    16.3   <2e-16 ***
## KITCHENS     -123866       4855   -25.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 466000 on 31431 degrees of freedom
## Multiple R-squared:  0.356,  Adjusted R-squared:  0.356 
## F-statistic: 5.79e+03 on 3 and 31431 DF,  p-value: <2e-16
vif(fpmd)
##   BATHRM    BEDRM KITCHENS 
##      1.8      1.8      1.2

In this reference model, all of the coefficients are significant. Adjusted R-squared is 0.356 and vif test doesn’t show multicollinearity.

fpmdnokc <- lm(PRICE ~ BATHRM + BEDRM, data = rp_num)
summary(fpmdnokc)
## 
## Call:
## lm(formula = PRICE ~ BATHRM + BEDRM, data = rp_num)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2452447  -255578   -47343   192628  9989209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -181635       8757   -20.7   <2e-16 ***
## BATHRM        281157       3149    89.3   <2e-16 ***
## BEDRM          34721       3080    11.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 471000 on 31432 degrees of freedom
## Multiple R-squared:  0.343,  Adjusted R-squared:  0.343 
## F-statistic: 8.2e+03 on 2 and 31432 DF,  p-value: <2e-16
vif(fpmdnokc)
## BATHRM  BEDRM 
##    1.7    1.7
boxplot(rp$PRICE~rp$BEDRM, col=topo.colors(20))

boxplot(rp$PRICE~rp$KITCHENS, col=topo.colors(6))

anova(fpmd, fpmdnokc)
## Analysis of Variance Table
## 
## Model 1: PRICE ~ BATHRM + BEDRM + KITCHENS
## Model 2: PRICE ~ BATHRM + BEDRM
##   Res.Df              RSS Df        Sum of Sq   F Pr(>F)    
## 1  31431 6828125428264322                                   
## 2  31432 6969530954644670 -1 -141405526380348 651 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After removing KITCHENS, coefficients are still significant, vif test also doesn’t illustrate multicollinearity. However, the Adjusted R-square decreases and the ANOVA test shows the importance of kitchen to the model. These means KITCHENS can improve the model but is not strongly related to price. I guess the reason is that the scale of the number of kitchens is narrow, thus the effect of the kitchen hardly reflects to price.

3.1.2 Find the price of your dream house.

In this part, we want to have some interaction between speaker and audiences in our presentation, so we would invite a volunteer to tell us how many bathrooms, bedrooms, and kitchens he/she want for his/her dream house in DC.

Price_filter <- rp_num[rp_num$BATHRM== 4 & rp_num$BEDRM==4 & rp_num$KITCHENS == 1,]

Based on the overview of properties in DC, we also want this volunteer to guess the price of his/her dream house. We use the t-test to show if he/she is a good guesser. For example, this audience set 300000 USD at an expected price. In our t-test, we assume the mean = 300000 as our null hypothesis. Then we could get the p-value to measure if his/her guessed number is close to the truth in 0.99 confidence level.

ttest99 = t.test(x=Price_filter$PRICE, mu=300000, conf.level=0.99 )
ttest99
## 
##  One Sample t-test
## 
## data:  Price_filter$PRICE
## t = 20, df = 300, p-value <2e-16
## alternative hypothesis: true mean is not equal to 300000
## 99 percent confidence interval:
##   940387 1118092
## sample estimates:
## mean of x 
##   1029240
ttest99$conf.int
## [1]  940387 1118092
## attr(,"conf.level")
## [1] 0.99
ttest99$alternative
## [1] "two.sided"
ttest99$estimate
## mean of x 
##   1029240

The box plot and summary indicate that even though people can purchase a house on the minimum price, they may not get a perfect one. Nobody wants to live in an extremely narrow house with a lot of rooms.Area is a key factor to measure a dream house(This is a bridge to transit to Area part).

boxplot(Price_filter$PRICE)

summary(Price_filter$PRICE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  212525  615000  920000 1029240 1320000 4300000

3.1.3 How do area and landarea influence price?

According to this scatter plot. Gross building area (GBA) is positively related to price. We color the scatter dots by ‘Landarea’. Since the color distributes irregularly. It reflects ‘Landarea’ may not affect the price. The sharp edges are due to data cleaning.

plot(rp_num$PRICE,rp_num$GBA,col=rp_num$LANDAREA, main = 'GBA-Price(Colored by Landarea)', xlab = 'Price', ylab = 'GBA')

plot(rp_num$PRICE,rp_num$LANDAREA, main = 'Landarea-Price', xlab = 'Price', ylab = 'Landarea')

We want to prove if land area is not related to price. Thus we first draw a scatter plot for land area and price. This plot doesn’t show any regular correlation between them. We also draw Q-Q plots for the GBA and Land area. These plots illustrate that are not normally distributed.

qqnorm(rp_num$LANDAREA, main = "Landarea Q-Q Plot") 
qqline(rp_num$LANDAREA)

qqnorm(rp_num$GBA, main = "GBA Q-Q Plot") 
qqline(rp_num$GBA)

area_corr <- cor(rp_num[c('LANDAREA','GBA','PRICE')])
corrplot(area_corr, method = "square")

The correlation plot shows it is related to price, but we finally want to use linear regression models and ANOVA test to determine their relationship. According to the ANOVA test, it’s easy to see the land area doesn’t affect price because the p-value is pretty large when we add land area into the model. It means Model 1 and Model 2 are the same. Also, the R-Squares in lm_GBA and lm_GBA_LANDAREA are almost same but lmLANDAREA is pretty small, which support us to make a conclusion: the land area is not a key element to change the price.

lm_GBA_LANDAREA <- lm(formula = PRICE ~ LANDAREA + GBA , data = rp_num)
lm_GBA <- lm(formula = PRICE ~ GBA , data = rp_num)
lm_LANDAREA<- lm(formula = PRICE ~ LANDAREA, data = rp_num)
#summary(lm_GBA)
#summary(lm_GBA_LANDAREA)
#summary(lm_LANDAREA)


anova(lm_GBA, lm_GBA_LANDAREA, lm_LANDAREA)
## Analysis of Variance Table
## 
## Model 1: PRICE ~ GBA
## Model 2: PRICE ~ LANDAREA + GBA
## Model 3: PRICE ~ LANDAREA
##   Res.Df              RSS Df         Sum of Sq        F Pr(>F)    
## 1  31433 5667532724398895                                         
## 2  31432 5667527409628765  1        5314770130     0.03   0.86    
## 3  31433 9147628459924014 -1 -3480101050295249 19300.57 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.2 SMART Questions about Location

Let’s study the affect of location on prices. DC has different wards. Let’s see if prices are equally distributed among all wards.

loadPkg("ggmap")
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
data <- read.csv("./data/DC_Properties.csv")
data <- data[!is.na(data$PRICE),]    #removing rows with no price

# Location/Ward importance to the price?
wards <- vector()
q1_wards <- vector()
q3_wards <- vector()
max_wards <- vector()
avg_wards <- vector()
#names(wards) <- list(unique(data$WARD))

for (i in sort((unique(data$WARD)))){
  max_wards[[i]] <- max(subset(data,WARD==i)$PRICE)
  avg_wards[[i]] <- mean(subset(data,WARD==i)$PRICE)
  wards[[i]] <- list(sort(list(subset(data,WARD==i)$PRICE)[[1]]))
  q1_wards[[i]] <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[2]]
  q3_wards[[i]] <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[4]]
}

barplot(avg_wards)
lines(q1_wards)
lines(q3_wards)

Ward location does affect house pricing. Houses in ward 3 are obviously more expensive than the rest. Let’s check if any other location based factors affect the prices too. Each house is on a different floor. We should see if floors affect the pricing just like wards.

stor <- vector()
avg_stor <- vector()

for (i in (unique(data$STYLE))){
  avg_stor[[i]] <- mean(subset(data,STYLE==i)$PRICE)
  stor[[i]] <- list(sort(list(subset(data,STYLE==i)$PRICE)[[1]]))
  
}
barplot(order(unlist(avg_stor)), horiz = FALSE, names.arg=sort(names(avg_stor)))

Relatively, there is no specific floor which gives the house a higher price like ward does. Let’s analyse each and every ward of DC. Diving the house prices of each ward into quantiles and then visualising will tell us if the distrubution is truly location based or some other factors are affecting the prices.

register_google(key = "AIzaSyAhGGV1_J9ipBAsF6vE7fg56zjDy_uaCvA")   #to be kept private
#plot all wards to see ward location
qmap_name <- "Washington dc"
dc <-qmap(qmap_name, zoom=12,maptype = "terrain")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Washington%20dc&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Washington+dc&key=xxx
data2 <- data[!is.na(data$WARD),]

dc <- dc + geom_point(aes(x = LONGITUDE, y = LATITUDE, colour = WARD),data = data2) + ggtitle("DC WARDS")

print(dc)
## Warning: Removed 1666 rows containing missing values (geom_point).

wards <- vector()

for (i in sort((unique(data$WARD)))){
  cat(i)
  wards[[i]] <- list(sort(list(subset(data,WARD==i)$PRICE)[[1]]))
  q1 <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[2]]
  q2 <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[3]]
  q3 <- quantile(wards[[i]][[1]], seq(from = 0, to = 1, by = 0.25))[[4]]
  
  data2 <- subset(data,WARD==i)
  
  data2$Quantile[data2$PRICE > q3] <- "4"
  data2$Quantile[data2$PRICE < q3] <- "3"
  data2$Quantile[data2$PRICE < q2] <- "2"
  data2$Quantile[data2$PRICE < q1] <- "1"
  data2 <- data2[!is.na(data2$Quantile),] 
  qmap_name <- paste(c(i ,", washington dc"), collapse = "")
  dc <-qmap(qmap_name, zoom=15)
  dc <- dc + geom_point(aes(x = LONGITUDE, y = LATITUDE, colour = Quantile),data = data2) + ggtitle(i)
  
  #png(filename= paste(c(i,".png"), collapse = ""))
  print(dc)
  #dev.off()
}
## Ward 1
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%201,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+1,+washington+dc&key=xxx
## Warning: Removed 12446 rows containing missing values (geom_point).
## Ward 2
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%202,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+2,+washington+dc&key=xxx

## Warning: Removed 9683 rows containing missing values (geom_point).
## Ward 3
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%203,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+3,+washington+dc&key=xxx

## Warning: Removed 11645 rows containing missing values (geom_point).
## Ward 4
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%204,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+4,+washington+dc&key=xxx
## "Ward 4, washingto..." not uniquely geocoded, using "ward ct nw, washington, dc 20036, usa"

## Warning: Removed 11969 rows containing missing values (geom_point).
## Ward 5
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%205,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+5,+washington+dc&key=xxx

## Warning: Removed 9730 rows containing missing values (geom_point).
## Ward 6
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%206,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+6,+washington+dc&key=xxx

## Warning: Removed 13506 rows containing missing values (geom_point).
## Ward 7
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%207,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+7,+washington+dc&key=xxx
## "Ward 7, washingto..." not uniquely geocoded, using "4645 nannie helen burroughs ave ne, washington, dc 20019, usa"

## Warning: Removed 6438 rows containing missing values (geom_point).
## Ward 8
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Ward%208,%20washington%20dc&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Ward+8,+washington+dc&key=xxx

## Warning: Removed 4740 rows containing missing values (geom_point).

## Ward 2: Some of the most expensive houses are locaated in this ward. Almost all houses have high range, which means average value of houses in general might be less due to some houses in this ward due to other factors but overall the price is high. ## Ward 3: Almost all houses are expensive. Any house in this ward will have a high price. This means prices here are affected more by the location than any other factor. ## Ward 5, 7 and 8: These areas in DC have equal distribution from affordable houses to expensive ones. In such cases, location plays minor role and other factors such as land area are more important.

The other factors mentioned could be time. Which is why time analysis is important. Sell date of a house coupled with inflation could tell us a lot more. Let’s analyse price on basis of when the house was sold.

get_year <- function(x){
  return(as.numeric(substring(x,0,4)))
}
sold_price <- vector()
sold_num <- vector()
data2 <- data[!is.na(data$SALEDATE),]
data2$sell_year <- get_year(data2$SALEDATE)
data2 <- data2[!is.na(data2$sell_year),]
data2 <- subset(data2,sell_year>1990)
#sapply(data2$sell_year, numeric)
for (i in sort(unique(data2$sell_year))){
  data3 <- subset(data2,sell_year==i)
  sold_num[[as.character(i)]] <- nrow(data3)
  sold_price[[as.character(i)]] <- mean(data3$PRICE)
}

plot(sold_num, type = "s", xaxt="n")
options(scipen=20)
axis(1,at=1:length(unique(data2$sell_year)),labels=names(sold_num))

(Further analysis of time/sell year in 3.3.2)

3.4 SMART Questions about Facilities

In this part, we explored the relationship between the internal facilities and the price, mainly focusing on two parts: the heating and cooling facilites. By the way, in our dataset, AC (the air conditioner) represents the cooling facilities and the heat represents the heating facilities.

First, since the variables of AC and HEAT are characters, we use ‘table()’ function to count the frequency of different types in the two variables. As shown below, HEAT has 14 levels indicating 14 types of heaters while AC has only 2 levels indicating whether propeties have or don’t have the air conditioners.

names(rp)
##  [1] "BATHRM"             "HEAT"               "AC"                
##  [4] "NUM_UNITS"          "ROOMS"              "BEDRM"             
##  [7] "STORIES"            "PRICE"              "QUALIFIED"         
## [10] "SALE_NUM"           "GBA"                "STYLE"             
## [13] "STRUCT"             "GRADE"              "CNDTN"             
## [16] "EXTWALL"            "ROOF"               "INTWALL"           
## [19] "KITCHENS"           "FIREPLACES"         "LANDAREA"          
## [22] "ZIPCODE"            "ASSESSMENT_NBHD"    "ASSESSMENT_SUBNBHD"
## [25] "CENSUS_TRACT"       "WARD"               "QUADRANT"          
## [28] "SALE_YR"            "SPAN_YR_RMDL"       "SPAN_AYB"          
## [31] "SPAN_EYB"           "PRICE_GBA"
temp_price<- rp[c(2,3,8,12)]
temp_price_n<- rp_num[c(2,3,8,12)]

temp_price[is.null(temp_price)]<-"0"
str(temp_price)
## 'data.frame':    31435 obs. of  4 variables:
##  $ HEAT : Factor w/ 14 levels "Air Exchng","Air-Oil",..: 13 8 8 8 13 13 13 8 6 6 ...
##  $ AC   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ PRICE: num  1095000 2100000 1602000 1050000 1430000 ...
##  $ STYLE: Factor w/ 19 levels "","1 Story","1.5 Story Fin",..: 8 8 8 8 5 5 5 5 5 5 ...
table(temp_price$HEAT)
## 
##     Air Exchng        Air-Oil  Elec Base Brd   Electric Rad       Evp Cool 
##              6             16             31             16              8 
##     Forced Air Gravity Furnac  Hot Water Rad        Ht Pump       Ind Unit 
##          13685             16           9529            554              4 
##        No Data   Wall Furnace      Warm Cool Water Base Brd 
##              0             32           7479             59
table(temp_price$AC)
## 
##     N     Y 
##  4563 26872
length(table(temp_price$HEAT))
## [1] 14
length(table(temp_price$STYLE))
## [1] 19

We draw a correlation matrix of the sub-dataset to show the correlation between cooling facilities, heating facilites and the properties’price.It??s very clear the price of shows a relative high correlation with the heating and cooling facilities.

temp_price_corr <- cor(rp_num[c('HEAT','AC','PRICE')])
corrplot(temp_price_corr, method = "square")

3.4.1 SMART Question: Does heating facilities have effect on price?

First, from the aspect of heating facilites, we use bar chart to reflect the using frequency of different types of heaters. We can see forced air, hot water rad and warm cool are used most.

HEAT_freq<-as.data.frame(table(temp_price$HEAT))
names(HEAT_freq)<-c("Types","Frequence")

ggplot(data=HEAT_freq, mapping=aes(x=Types,y=Frequence))+
geom_bar(stat="identity",size=50,fill="#D55E00")+coord_flip()

When it comes to the relationship between heating facilities and price, we use the Analysis of Variance (ANOVA). It’s clear that the p-value is quite small, we can say the heating facilities do have effects on the price.

HEAT <- subset(rp, select=c('PRICE','HEAT'))
aov_p_h <- aov(PRICE~HEAT, data=HEAT)
aov_p_h 
## Call:
##    aov(formula = PRICE ~ HEAT, data = HEAT)
## 
## Terms:
##                              HEAT         Residuals
## Sum of Squares    545246686635119 10059099190935380
## Deg. of Freedom                12             31422
## 
## Residual standard error: 565800
## Estimated effects may be unbalanced
summary(aov_p_h)
##                Df            Sum Sq        Mean Sq F value
## HEAT           12   545246686635119 45437223886260     142
## Residuals   31422 10059099190935380   320129183086        
##                          Pr(>F)    
## HEAT        <0.0000000000000002 ***
## Residuals                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since there are so many types of heaters, we may wonder whether differen types of heaters have different effects on the price.From boxplot, we can see the price of some types shows a larger or smaller scope such as Air Exchng and Evp Cool, but the price of the three types used most-forced air, hot water rad and warm cool- show the similar medium and scope, which indicate people may not really care about what types of heaters in your house as long as it is warm enough.

plot(PRICE~HEAT, data=HEAT,col=topo.colors(20))

3.4.2 SMART Question:Does cooling facilities have effect on price?

As for cooling systems, we only have information of yes or no. As the bar chart shows, most residiential properties do have the cooling systems(air conditioners).

AC_freq<-as.data.frame(table(temp_price$AC))
names(AC_freq)<-c("YN","Frequence")

ggplot(data=AC_freq, mapping=aes(x=YN,y=Frequence))+
geom_bar(stat="identity",width=0.4,fill="#D55E00")

Boxplot the price with or without AC(air conditioner).

We use the boxplot to compare the price when properties with or without cooling systems(air conditioners). As shown below, we can see the price shows a whole ascending trend from no to yes, which can be concluded that the cooling systems(air conditioners) do have effects on the price of residitial properties.

plot(x=rp$AC,y=rp$PRICE,log="y")

To prove the relationship between price and cooling facilities, We also use the Analysis of Variance (ANOVA), the p-value is quite small, we can say the cooling facilities(air conditioners) do have significant effects on price as well.

AC = subset(rp, PRICE>0, select=c('PRICE','AC'))
aov_p_c = aov(PRICE~AC, data=AC)
aov_p_c 
## Call:
##    aov(formula = PRICE ~ AC, data = AC)
## 
## Terms:
##                                AC         Residuals
## Sum of Squares    296500103900770 10307845773669598
## Deg. of Freedom                 1             31433
## 
## Residual standard error: 572652
## Estimated effects may be unbalanced
summary(aov_p_c)
##                Df            Sum Sq         Mean Sq F value
## AC              1   296500103900770 296500103900770     904
## Residuals   31433 10307845773669598    327930702563        
##                          Pr(>F)    
## AC          <0.0000000000000002 ***
## Residuals                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Furthermore, we use T-test to prove it. We set the null hypothesis as whether properties have or have no air conditioners, the means of the price are equal.And we set the alternative hypothesis as when properties have or have no air conditioners , the means of the price equal are not equal.Besides, we construct t-intervals at 0.999 level. As the result shows below, the P-value is less than 0.001, so we reject that the price of properties with or without have the same mean values. So air conditioners do have effects on price.

AC_Y<-subset(rp_num,AC==2)
AC_N<-subset(rp_num,AC==1)
t.test(AC_Y$PRICE, mu = mean(AC_N$PRICE), conf.level = 0.99)
## 
##  One Sample t-test
## 
## data:  AC_Y$PRICE
## t = 70, df = 30000, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 471795
## 99 percent confidence interval:
##  737982 757017
## sample estimates:
## mean of x 
##    747500

Chapter 4: Conclusion

In conclusion, we realize that data cleaning is a very important step for data analysis. Before our SMART questions, we’ve spent a lot time in data preprocessing and cleaning. If we take a look at this clean-out data, they are very unhealthy and may lead our analysis to the wrong directions. For the conlusion of SMART Questions: we find gross building area is a significant factor to decide the price of an property; It is significant that certain areas (ward 2&3) have higher housing price in DC; growth rate of the housing price is twice than the inflation rate during the recent 25 years, nevertheless, it is still a steady rate; As for the facilities in the properties, both cooling and heating affect price, but types of heating do not matter to the price.